Measuring informative content of words in documents in a document collection relative to a probability function including a concavity control parameter

ABSTRACT

Processing methods and systems are provided for representing documents relative to importance of words in the document. A processor comprising a weighting model of word importance in a document in a collection relative to an importance of the word in other documents in the collection computes a deviation of distribution of the word from a probability distribution of the word in other documents in the collection, where the deviation distribution is weighted in accordance with a concavity control function. A concavity control parameter is adjustable relative to word frequency.

BACKGROUND

Many tasks such as Information Retrieval, Clustering and Categorization represent documents by vectors, where each dimension index of a document can represent a given word and where a value can encode the word importance in the document. The field of the subject embodiments relate to a representation of documents by the relative importance of a word in a document and to a particular weighting model relative to term frequencies in an associated collection. More particularly, the embodiments relate to weighting the measure of relative importance using a concavity control parameter.

Many weighting models try to quantify the importance of a word in a document with probabilistic measures. For each term in the collection, a probability distribution is associated to the term. For any word in any document, a probability according to the collection model can then be computed. The computed probability is related to the informative content of the word in the document or document collection (corpus).

The main hypothesis of many weighting models is the following: The more the distribution of a word in a document deviates from its average distribution in the collection, the more likely is this word significant for the document considered. This can be easily captured in terms of Shannon information. Let X_(ω) the random variable of frequencies of word ω and P a probability distribution with parameter λ_(ω), then the Shannon information measure is:

Info P(X _(w) =x)=−logP(X=x _(w)|λ_(w))=InformativeContent   (1)

If a word behaves in the document as expected in the collection, then it has a high probability of P(X=x_(ω)|λ_(ω)) occurrence in the document, according to the collection distribution, and the information it brings to the document, −log P(X=x_(ω)|λ_(ω)), is small. On the contrary, if it has a low probability of occurrence in the document, according to the collection distribution, then the amount of information it conveys is greater

This is the idea underpinning the classical Divergence From Randomness Model and information-based models in Information Retrieval.

Hence, the cornerstone of many weighting models consists in using Shannon information −log P(X_(ω)|λ_(ω)) to measure the importance of a word in a document and weighting of words in documents.

Table 1 identifies many of the notations used in the remainder of this disclosure.

TABLE 1 Notations Notation Description q, d Original query, document RSV (q, d) Retrieval status value of d for q (ie Ranking Function) x_(wd) # of occurrences of w in doc d X_(w) Discrete Random Variable for the x_(wd) T_(w) Continuous Random Variable for normalized occurrences l_(d) Length of doc d avgl Average document length in collection N # of docs in collection N_(w) # of documents containing w idf (w) −log(N_(w)/N) P(w|C) Probability of the word in the collection

A known notion to define the family of IR models is the following equation:

RSV(q, d)=Σ_(w∈q) −q _(w)logP(T _(w) ≧t _(wd)|λ_(w))   (2)

where T_(w) is a continuous random variable modeling normalized term frequencies and λ_(w) is a set of parameters of the probability distribution considered. This ranking function corresponds to the mean information a document brings to a query or, equivalently, to the average of the document information brought by each query term a.

Few words are needed to explain the choice of the probability P(T_(w)≧t_(wd)) in the information measure. Shannon information was originally defined on discrete probability and the information quantity from the observation of x was measured with −logP(X=x|Θ). As the normalized frequencies t_(wd) are continuous variables, Shannon information cannot be directly applied.

A known solution is to measure information on a probablility of the form P((t_(wd)−a—T_(w)—t_(wd)+b|λ_(w)). However, one has to choose values for a and b, and a=0 and b=+∞ have been chosen for theorical reasons. The mean frequency of most words in a document is close to 0. For any word large frequencies are typically less likely than smaller frequencies on average. The larger the term frequency is, the smaller P(T_(w)≧t_(wd)) is and the bigger −logP(T_(w)≧t_(wd)). Hence, the use of the survival function P(T>t) seems compatible with the notion of information content discussed above.

Overall, the general idea of the information-based family is the following:

1. Due to different document length, discrete term frequencies (x) are renormalized into continuous values (t(x)) 2. For each term w, one can assume that those renormalized values follow a probability distribution P on the corpus. Formally, T_(w):P(.|λ_(w)).

3. Queries and documents are compared through a measure of surprise, or a mean of information of the form

${{RSV}\left( {q,d} \right)} = {\underset{w \in q}{\Sigma} - {q_{w}\log \; {P\left( {T_{w} \geq {t(x)}} \middle| \lambda_{w} \right)}}}$

So information models are specified by two main components: a function which normalizes term frequencies across documents, and a probability distribution modeling the normalized term frequencies. Information is the key ingredient of such models since information measures the significance of a word in a document.

Such known information models measure the relative importance of a word in a document compared to its importance in other documents in the collection by either a fixed weighting function (natural log) or the proposition of ad-hoc classes (which do not focus on concavity control).

There is always a need for new and improved representation of documents which can yield better results than known representations in document retrieval, categorization and clustering tasks. The subject embodiments address this need.

Summary of the Preferred Embodiments

The subject embodiments relate to the use of a generalized algorithmic function including a concavity control parameter to measure the importance of words in documents.

In one preferred embodiment a method is provided for document representation by measuring informative content of words comprising associating a probability distribution for a plurality of words in a document collection;

forming a weighting model of word importance to the collection from the probability distribution therein;

determining a distribution of a selected word in a particular document in the collection;

computing a deviation of distribution of the selected word in the document from the probability distribution of the word in the collection in accordance with a concavity function; and

assigning a greater importance of the selected word to the informative content of the document by the greater the computed deviation whereby the document can be represented relative to the collection by the assigned importance of a plurality of selected words.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is flowchart showing steps of a preferred embodiment;

FIG. 2 is a graph of an i′-logarithm illustrating concavity control; and,

FIG. 3 is a graph showing a Mean Average Precision for dataset queries once the SVD has been performed for several latent dimensions.

DETAILED DESCRIPTION

The preferred embodiments use a generalized logarithmic function including a concavity control parameter to measure informative contents of documents.

Instead of measuring informative content with a fixed valued log function −logP(T>t), the subject embodiments measure the informative content of a word in a document with a variable concavity control parameter function, −log_(η)P(T>t), where i sets curvature of the logarithm.

So, the feature vector of a document can be seen as:

φ(d)=(−log_(η) P(T _(w1)|λ₁), . . . ,−log_(η) P(T _(M)|λ_(M)))   (5)

The η-logarithm [5] comes from the statistical physics community and can be understood as a generalization of the logarithm function. The η-logarithm is defined by ∀t>0:

$\begin{matrix} {{\ln_{\eta}(t)} = {\frac{1}{1 - \eta}\left( {t^{1 - \eta} - 1} \right)}} & (6) \end{matrix}$

The interesting properties of this curved logarithm are:

ln_(η)(1) = 0 $\frac{\partial{\ln_{\eta}(t)}}{\partial t} = {\frac{1}{t^{\eta}} = t^{- \eta}}$

-   -   η=1 leads to the familiar log function.

FIG. 2 shows the graph of different η-logarithms including a concavity control parameter to illustrate how variability of the parameter affects concavity of the function.

The subject embodiments are motivated by the following points. In information retrieval tasks, the concavity in term frequencies of the retrieval function is a critical property of an IR model. For example, concavity favors documents that contain different query terms (i.e. aspects) against documents that contain one query term. A priori, there is no intrinsic reason for the log function to have the best analytical properties in an IR setting. Hence, the point of this generalization is to adjust the concavity of weighting functions. In fact, the −log_(η) can parameterize a family of concave functions which will lead to better performance.

Although there is gain in expressive power with the introduction of the parameter η, there is also a loss in the interpretation in terms of probability. With the log function it is possible to write the information measure a document conveys as:

−logP(T ^(d)=(t _(w1) , . . . , t _(wM)))=−log Π_(w) P(T _(w) ^(d) >t _(w))=−Σ logP(T _(w) ^(d) >t _(w))   (7)

Note that the previous equation assumes independence assumptions between different words. With the ft-logarithm, it is not possible anymore to have such interpretation.

One can define a generalized information model of the form:

RSV(q, d)=Σ_(w∈q∩d) −q _(w)ln_(η) P(T _(w) ≧t _(wd) |λ _(w))   (8)

This model, noted QLN, has then two parameters:

-   -   c which normalizes the term frequencies

$\left( {t_{wd} = {x_{wd}{\log \left( {1 + {c\frac{avgl}{l_{d}}}} \right)}}} \right)$

-   -   η which sets the curvature of the logarithm

The benefit of such model is the ability to play with the curvature of the model in order to assess the role of concavity. Analysis of the model concavity includes an analysis of the weighting function h=ln_(η)P(T_(w)≧t_(wd)|λ_(w)) of this model with a log-logistic distribution.

$\begin{matrix} {{h\left( {t,\lambda,\eta} \right)} = {{- \frac{1}{1 - \eta}}\left( {\left( \frac{\lambda}{\lambda + t_{wd}} \right)^{1 - \eta} - 1} \right)}} & (9) \\ {\frac{\partial h}{\partial t} = {{- \frac{1 - \eta}{1 - \eta}}\left( \frac{\lambda}{\lambda + t_{wd}} \right)^{- \eta}\left( {- \frac{1}{\left( {\lambda + t_{wd}} \right)^{2}}} \right)}} & (10) \\ {\frac{\partial h}{\partial t} = \frac{\left( {\lambda + t_{wd}} \right)^{\eta - 2}}{\lambda^{\eta}}} & (11) \\ {\frac{\partial^{2}h}{\partial t^{2}} = {\left( {\eta - 2} \right)\frac{\left( {\lambda + t_{wd}} \right)^{\eta - 3}}{\lambda^{\eta}}}} & (12) \end{matrix}$

Within this family of IR models, it is possible to get concave models. Heuristic conditions in Information Retrieval suggest that the IR weighting model should be concave in term frequencies. Hence, heuristic conditions suggest that η must be inferior than 2. The case where η=2 is the case where the model is linear:

$\begin{matrix} {h = {\frac{\lambda + t_{wd}}{\lambda} - 1}} & (13) \end{matrix}$

Note that whenever η<1, the weighting function h is bounded:

$\begin{matrix} {h = {{\frac{1}{1 - \eta}\left( {1 - \left( \frac{\lambda}{\lambda + t_{wd}} \right)^{1 - \eta}} \right)} \leq \frac{1}{1 - \eta}}} & (14) \end{matrix}$

Testing has shown that the generalized information model outperforms known baselines. So, changing the curvature of the models allows one to obtain significant improvements over prior known fixed weighting functions (natural log) or ad-hoc classes which do not focus on concavity control.

Latent Semantic Indexing (LSI) [3] computes a Singular Value Decomposition over a term-document matrix in order to find latent topics. It is also an approach to perform document clustering.

For example, several variants of LSI can be tried on the CLEF (define) dataset. The term document matrix is weighted with a TF-IDF baseline, or an informative content with log-logistic distribution as in information models or in accordance with the subject embodiment with the generalized informative content. More formally, the weighted term document matrix I_(wd) is defined as follows:

$\begin{matrix} {l_{wd} = {{- \log_{\eta}}{P\left( t_{wd} \middle| \lambda_{w} \right)}}} & (15) \\ {t_{wd} = {x_{wd}{\log \left( {1 + {c\frac{avgl}{l}}} \right)}}} & (16) \\ {c = { - {1({iedefaultvalueofc})}}} & (17) \\ {{P\left( {J = \left. t_{wd} \middle| \lambda_{wd} \right.} \right)} = {\frac{\lambda_{w}}{\lambda_{w} + t_{wd}}\left( {{ielog} - {logisticdistribution}} \right)}} & (18) \end{matrix}$

e−1 can be seen as a default value of c since the occurrences of document of average length are not renormalized (i.e. the normalization factor is equal to 1).

FIG. 3 show the Mean Average Precision for the dataset queries once the SVD has been performed for several latent dimensions. This Figure shows that the generalized informative with η=1.2 outperform the two other baselines in most cases. More particularly, it shows on the y-axis a performance measure, against a parameter of the SVD method. Performance improved with the subject embodiment.

Overall, this shows that the proposal is also adequate for computing similarities between documents as better performances are obtained. Many clustering algorithms rely on distance functions between documents. So, this suggests that the subject method would be beneficial for document clustering tasks.

The subject embodiments also can result in improved text classification. The table below illustrates an application to text classification. A logistic regression is trained with L2 regularization with different embedding of term frequencies. For each dataset, the best value of η is kept among {0.5, 0.75, 1, 1.2, 1, 5} and the performance is compared against several TF-/DF weightings and a standard log function. A default value is chosen for c=e−1 and the regularization parameter is chosen on the validation set among {0.0001, 0.01, 0.1, 1, 5, 10, 50, 100}. The Table shows the mean accuracy for 3 standards test collections.

Weighting Model φ(d) 20NewsGroup Sector WebKB t fidf L2 norm $\frac{x_{w}{{idf}(w)}}{\sqrt{\sum_{w}\left( {x_{w}{{idf}(w)}} \right)^{2}}}$ 87.9 87.0 81.1 {square root over (tfidf)} L2 norm $\frac{\sqrt{x_{w}{{idf}(w)}}}{\sqrt{\sum_{w}\left( {x_{w}{{idf}(w)}} \right)}}$ 88.8 87.2 85.9 Battacharya t f $\sqrt{\frac{x_{w}}{\sum_{w}x_{w}}}$ 86.0 84.4 88.6 log, ie η = 1 −log P(T_(w) = t_(w)|λ_(w)) 88.4 87.9 87.2 Best η −log_(η) P(T_(w) = t_(w)|λ_(w)) 89.0 89.5 88.9 The Best η is seen at the bottom row.

With reference to FIG. 1, a flowchart showing steps of a preferred embodiment is illustrated. Initially, for a plurality of words in a document collection, a probability distribution is associated 10 for them in the collection. A weighting model is then formed 12 of word importance to the collection from the probability distribution. A distribution of a selected word in a particular document in the collection is then determined 14. A deviation of distribution of the select word in the document from the probability distribution of the word in the collection is computed 16 wherein the deviation distribution is weighted in accordance with a concavity function. Lastly, a greater importance of the select word to the informative content of the document can be assigned by the greater the computed deviation 18.

It will be appreciated that variants of the above-disclosed and other features and functions, or alternatives thereof, may be combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims. 

What is claimed is:
 1. A processing system for representing documents relative to importance of words in the document, including: a processor comprising a weighting model of word importance in a document in a collection relative to an importance of the word in other documents in the collection by computing a deviation of distribution of the word from a probability distribution of the word in other documents in the collection, wherein the deviation distribution is weighted in accordance with a concavity function.
 2. The system of claim 1 wherein the concavity function comprises applying a concavity control parameter to a weighting function in the computing of the deviation of distribution.
 3. The system of claim 2 wherein the concavity control parameter is selectively variable.
 4. The system of claim 3 wherein the variability of the concavity control parameter is effected relative to a probable frequency of the word in the document.
 5. The system of claim 2 wherein the weighting model is represented as: RSV(q, d)=Σ_(w∈q∩d) −q _(w)ln_(η) P(T _(w) ≧t _(wd)|λ_(w)) wherein parameter c normalizes word frequencies for determining the importance of the word, and parameter r_(i) sets curvature of the logarithm
 6. The system of claim 2 wherein the concavity control parameter is varied in accordance with predetermined standards to better effect text classification or latent semantic indexing.
 7. A method for document representation by measuring informative content of words, comprising: associating a probability distribution for a plurality of words in a document collection; forming a weighting model of word importance to the collection from the probability distribution therein; determining a distribution of a selected word in a particular document in the collection; computing a deviation of distribution of the selected word in the document from the probability distribution of the word in the collection in accordance with a concavity function; and assigning a greater importance of the selected word to the informative content of the document by the greater than computed deviation in accordance with a concavity function whereby the document can be represented relative to the collection by the assigned importance of a plurality of selected words.
 8. The method of claim 7 wherein the computing in accordance with the concavity function includes applying a concavity control parameter to a weighting function in the computing of the deviation of distribution.
 9. The method of claim 7 wherein the concavity control parameter is selectively variable.
 10. The method of claim 9 wherein the variability of the concavity control parameter is effected relative to a probable frequency of the word in the document.
 11. The method of claim 8 wherein the computing is effected in accordance with: RSV(q, d)=Σ_(w∈q∩d) −q _(w)ln_(η) P(T _(w) ≧t _(wd)|λ_(w)) wherein parameter c normalizes word frequencies for determining the importance of the word, and parameter n sets curvature of the logarithm
 12. The system of claim 8 wherein the concavity control parameter is varied in accordance with predetermined standards to better effect text classification or latent semantic indexing. 